tl;dr

The goal of this project is to utilize statistical matching methods to search for a subset of Beta clients that are representative of Release. The specific use-case of this proof-of-concept was to utilize performance, configuration, and environment covariates of the clients for matching. Validation of the matching was performed on a hold-out set of Firefox user engagement covariates.

The following tables represent the relative difference between the Beta and Release train (v67) and validation (v68) data sets for the mean and median respectively. These initial results are promising and suggest that such techniques could be applied to Mozilla use-cases.

Beta-Release Difference: Mean

active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
pre-matching: v67 0.0308290 0.0326158 0.0239820 0.0345788 0.0326937 0.0370581 0.0087112 0.0082139 0.5308586 0.4699185 0.0152401 0.0198211 0.1966846 0.1884580
post-matching: v67 0.0612966 0.0568527 0.0454368 0.0384508 0.0176654 0.0029548 0.0648153 0.0638618 0.2885793 0.2740081 0.0523747 0.0456359 0.0633665 0.0663631
pre-matching: v68 0.0606451 0.0998606 0.0728483 0.1234272 0.0480039 0.0946543 0.0829937 0.0840384 0.4287439 0.3486020 0.0082032 0.0291493 0.1734935 0.1202410
post-matching: v68 0.0938118 0.0939657 0.0613485 0.0814801 0.0135225 0.0229419 0.0418302 0.0431463 0.3055825 0.2682372 0.0099754 0.0003550 0.0136131 0.0101047

Beta-Release Difference: Median

active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
pre-matching: v67 0.0839799 0.0826347 0.1046832 0.1269036 0.0476190 0 0.2401578 0.2372583 0.1333333 0 0.0104167 0.0833333 0.0000000 0.0000000
post-matching: v67 0.1032947 0.0886228 0.1100945 0.1116751 0.0857143 0 0.3017174 0.2972759 0.0285714 0 0.0676294 0.1515152 0.0740741 0.0588235
pre-matching: v68 0.1224242 0.1720430 0.1718107 0.2323232 0.2500000 0 0.3676804 0.3570644 0.1000000 0 0.0411921 0.1333333 0.0285714 0.1176471
post-matching: v68 0.1331136 0.1338028 0.1317805 0.1308017 0.0000000 0 0.2223247 0.2144493 0.0322581 0 0.0142857 0.0769231 0.1000000 0.1000000

Problem Statement

Each new Release version of Firefox is available as a prerelease before it is launched to the general community. It is highly desirable to utilize the telemetry from Beta versions of Firefox to determine critical aspects of Release behavior before it is launched to the general user base. However, it is well known that the Beta population has distinctly different characteristics than Release, such as the distribution of country of origin of its users, and have higher incidents of crashes. Therefore, directly utilizing Beta telemetry to inform Release is not statistically valid.

One possible approach to deal with this discrepency is to use statistical matching techniques to find a subset of Beta that is representative of Release. What connotates “representative” depends upon the desired use-case (outcome), such as performance characteristics, or crash rates. In this work, we focus on user engagement metrics as the chosen use-case. Here, we follow two different strategies for validate the resultant model:

  1. To balance on the other covariates (e.g., environment and performance metrics), then look at the difference in the user engagement metrics between the balanced Beta and Release for that version (N). This gives us an idea of how clients with similar environments and performance resemble Release in terms of usage.
  2. To balance the Beta and Release data sets to resemble each other across the covariates we are concerned with. Balancing, in this case, yields a set of client_id for Beta that resembles Release. Our application is then querying the current Beta data (Version N+1) for this client_id, and then calculate the metrics we care about from the covariates we care about. This gives us an idea of how these users do indeed change in time.

Methodology

Our methodology aims at defyning which aspect of Release behavior we will address with the Beta subset (e.g., start-up, user engagement, browser responsiveness) and, then, determining and focusing on the statistical matching approaches to address the chosen use-case. The methodology is summarized as follows:

  1. Build and prepare training and validation data sets containing Beta and Release clients (data preparation)
  2. Perform feature engineering on categorical data into dummy variables (feature engineering)
  3. Perform feature selection as an initial pre-filter to the covariates, to narrow the feature selection search space (feature selection)
  4. Perform statistical modeling for the chosen use-case, that is, user engagement metrics (modeling)

Data Preparation

The following filters are applied:

  • Desktop Firefox
  • Two weeks of collection per profile, starting with first observed ping within date window
  • en-US, en-GB locales
  • US, GB countries
rows columns discrete_columns continuous_columns all_missing_columns total_missing_values complete_rows total_observations memory_usage
v67 302819 97 8 89 0 0 302819 29373443 179294160
v68 328042 97 8 89 0 0 328042 31820074 192914864

Covariates

The following covariates were collected. These were categorized under training and hold-out. The former were used for training a statistical matching model. The latter are not included in training and are used for determining model performance. The covariates are further subcategorized as to what they measure.

Training

The follows makes up the training data set, used in statistical matching:

  • Version 67
  • \(39\) covariates:
    • Environment
      • cpu_cores
      • cpu_cores_cat
      • cpu_speed_mhz
      • cpu_speed_cat
      • cpu_vendor
      • cpu_l2_cache_kb
      • cpu_l2_cache_kb_cat
      • memory_mb
      • memory_cat
      • os_version
      • is_wow64
      • distro_id_norm
      • install_year
    • Geo
      • country
      • timezone_offset
      • timezone_cat
      • locale
    • Settings
      • num_bookmarks
      • num_addons
      • sync_configured
      • fxa_configured
      • is_default_browser
      • default_search_engine
    • Page Load
      • FX_PAGE_LOAD_MS_2_PARENT
      • TIME_TO_DOM_COMPLETE_MS
      • TIME_TO_DOM_CONTENT_LOADED_END_MS
      • TIME_TO_LOAD_EVENT_END_MS
      • TIME_TO_DOM_INTERACTIVE_MS
      • TIME_TO_NON_BLANK_PAINT_MS
    • Startup
      • startup_ms
      • startup_ms_max
    • Stability
      • content_crashes
    • Frequency of Browser Usage
      • num_active_days
      • daily_num_sessions_started
      • daily_num_sessions_started_max
      • session_length
      • session_length_max
      • profile_age
      • profile_age_cat
  • Composition:

Holdout

The followings filters constitute the validation data set:

  • Version 68
  • \(14\) covariates:
    • User engagement
      • active_hours
      • active_hours_max
      • uri_count
      • uri_count_max
      • search_count
      • search_count_max
      • num_pages
      • num_pages_max
      • daily_max_tabs
      • daily_max_tabs_max
      • daily_unique_domains
      • daily_unique_domains_max
      • daily_tabs_opened
      • daily_tabs_opened_max
  • Composition:

Feature (Covariate) Engineering

In this step, we perform feature engineering on categorical data into dummy variables, in case some of them turn out to be determinant factors when imputing other variables. Specifically, we employ two widely used techniques:

  • Binning is a common technique used to smooth noisy data by arranging numerical or categorical features into separate bins.
  • One-hot encoding is one of the most common encoding methods in machine learning, which spreads the values in a column to multiple flag columns and assigns 0 or 1 to them.

Through these techniques, we end up with larger training and validation data sets. The following reports show the differences between those data sets pre (df_train_f) and post (df_train_encoder) feature engineering.

Training (v67)

data.frame ncol nrow
pre-engineering df_train_f 63 302819
post-engineering df_train_encoder 97 302819
variable position class
52 V1 1 character
53 fxa_configured 29 logical
54 sync_configured 30 logical
55 is_default_browser 31 character
56 locale 32 character
57 normalized_channel 33 integer
58 default_search_engine 35 character
59 country 36 integer
60 cpu_vendor 42 integer
61 is_wow64 45 numeric
62 distro_id_norm 53 character
63 timezone_cat 54 character
64 cpu_l2_cache_kb_cat 59 character
65 label_beta 25 integer
66 label_release 26 integer
67 fxa_configured_False 29 integer
68 fxa_configured_True 30 integer
69 sync_configured_False 31 integer
70 sync_configured_True 32 integer
71 is_default_browser_False 33 integer
72 is_default_browser_True 34 integer
73 locale_en.GB 35 integer
74 locale_en.US 36 integer
75 normalized_channel_beta 37 integer
76 normalized_channel_release 38 integer
77 default_search_engine_Bing 40 integer
78 default_search_engine_DuckDuckGo 41 integer
79 default_search_engine_Google 42 integer
80 default_search_engine_other..bundled. 43 integer
81 default_search_engine_other..non.bundled. 44 integer
82 default_search_engine_Yahoo 45 integer
83 country_GB 46 integer
84 country_US 47 integer
85 cpu_vendor_AMD 53 integer
86 cpu_vendor_Intel 54 integer
87 cpu_vendor_Other 55 integer
88 is_wow64_False 58 integer
89 is_wow64_True 59 integer
90 distro_id_norm_acer 67 integer
91 distro_id_norm_Mozilla 68 integer
92 distro_id_norm_other 69 integer
93 distro_id_norm_Yahoo 70 integer
94 timezone_cat_..12..10. 71 integer
95 timezone_cat_..10..8. 72 integer
96 timezone_cat_..8..6. 73 integer
97 timezone_cat_..6..4. 74 integer
98 timezone_cat_..4..2. 75 integer
99 timezone_cat_..2.0. 76 integer
100 timezone_cat_.0.2. 77 integer
101 timezone_cat_.2.4. 78 integer
102 timezone_cat_.4.6. 79 integer
103 timezone_cat_.6.8. 80 integer
104 timezone_cat_.8.10. 81 integer
105 timezone_cat_.10.12. 82 integer
106 timezone_cat_.12.14. 83 integer
107 cpu_l2_cache_kb_cat_..1024 88 integer
108 cpu_l2_cache_kb_cat_..256 89 integer
109 cpu_l2_cache_kb_cat_..512 90 integer
110 cpu_l2_cache_kb_cat_..1024.1 91 integer
111 default_search_engine_missing 96 numeric

Validation (v68)

data.frame ncol nrow
pre-engineering df_validate_f 63 328042
post-engineering df_validate_encoder 97 328042
variable position class
52 V1 1 character
53 fxa_configured 29 logical
54 sync_configured 30 logical
55 is_default_browser 31 character
56 locale 32 character
57 normalized_channel 33 integer
58 default_search_engine 35 character
59 country 36 integer
60 cpu_vendor 42 integer
61 is_wow64 45 numeric
62 distro_id_norm 53 character
63 timezone_cat 54 character
64 cpu_l2_cache_kb_cat 59 character
65 label_beta 25 integer
66 label_release 26 integer
67 fxa_configured_False 29 integer
68 fxa_configured_True 30 integer
69 sync_configured_False 31 integer
70 sync_configured_True 32 integer
71 is_default_browser_False 33 integer
72 is_default_browser_True 34 integer
73 locale_en.GB 35 integer
74 locale_en.US 36 integer
75 normalized_channel_beta 37 integer
76 normalized_channel_release 38 integer
77 default_search_engine_Bing 40 integer
78 default_search_engine_DuckDuckGo 41 integer
79 default_search_engine_Google 42 integer
80 default_search_engine_missing 43 integer
81 default_search_engine_other..bundled. 44 integer
82 default_search_engine_other..non.bundled. 45 integer
83 default_search_engine_Yahoo 46 integer
84 country_GB 47 integer
85 country_US 48 integer
86 cpu_vendor_AMD 54 integer
87 cpu_vendor_Intel 55 integer
88 cpu_vendor_Other 56 integer
89 is_wow64_False 59 integer
90 is_wow64_True 60 integer
91 distro_id_norm_acer 68 integer
92 distro_id_norm_Mozilla 69 integer
93 distro_id_norm_other 70 integer
94 distro_id_norm_Yahoo 71 integer
95 timezone_cat_..12..10. 72 integer
96 timezone_cat_..10..8. 73 integer
97 timezone_cat_..8..6. 74 integer
98 timezone_cat_..6..4. 75 integer
99 timezone_cat_..4..2. 76 integer
100 timezone_cat_..2.0. 77 integer
101 timezone_cat_.0.2. 78 integer
102 timezone_cat_.2.4. 79 integer
103 timezone_cat_.4.6. 80 integer
104 timezone_cat_.6.8. 81 integer
105 timezone_cat_.8.10. 82 integer
106 timezone_cat_.10.12. 83 integer
107 timezone_cat_.12.14. 84 integer
108 cpu_l2_cache_kb_cat_..1024 89 integer
109 cpu_l2_cache_kb_cat_..256 90 integer
110 cpu_l2_cache_kb_cat_..512 91 integer
111 cpu_l2_cache_kb_cat_..1024.1 92 integer

Feature (Covariate) Selection

As statistical matching typically trains a machine learning (ML) model for calculation of propensity scores, variable selection should be employed. However, the literature suggests that typical ML techniques of limiting covariates to those that best predict the response are not helpful for statistical matching. Rather, covariates that are unrelated to the exposure (i.e., Beta or Release) but related to the outcome (i.e., user engagement metrics) should always be included in a propensity score model.

Therefore, in this step, we apply the Boruta algorithm as an initial pre-filter to the covariates, to narrow the feature selection search space. The Boruta algorithm is a wrapper built around the random forest classification algorithm, which tries to capture all the important features you might have in your data set concerning an outcome variable. In short, we apply the Boruta algorithm for each user engagement metric as the model outcome. Thus, for each outcome, the algorithm verifies if a feature is important or not. Finally, we find the top 5 and top 10 rankings features per metric, and add to lists. We also perfom this same process to a class balanced data set.

Hence, in total, we builted four covariate sets, that we called as experiments:

  • experient 1:
    • daily_num_sessions_started
    • daily_num_sessions_started_max
    • FX_PAGE_LOAD_MS_2_PARENT
    • fxa_configured_True
    • memory_mb
    • num_active_days
    • num_addons
    • num_bookmarks
    • profile_age
    • session_length
    • session_length_max
    • TIME_TO_DOM_COMPLETE_MS
    • TIME_TO_DOM_CONTENT_LOADED_END_MS
    • TIME_TO_DOM_INTERACTIVE_MS
    • TIME_TO_LOAD_EVENT_END_MS
    • TIME_TO_NON_BLANK_PAINT_MS
    • timezone_cat_0_2
  • experient 2:
    • country_US
    • daily_num_sessions_started
    • daily_num_sessions_started_max
    • default_search_engine_other_nonbundled
    • FX_PAGE_LOAD_MS_2_PARENT
    • fxa_configured_True
    • memory_mb
    • num_active_days
    • num_addons
    • num_bookmarks
    • profile_age
    • session_length
    • session_length_max
    • startup_ms
    • startup_ms_max
    • sync_configured_True
    • TIME_TO_DOM_COMPLETE_MS
    • TIME_TO_DOM_CONTENT_LOADED_END_MS
    • TIME_TO_DOM_INTERACTIVE_MS
    • TIME_TO_LOAD_EVENT_END_MS
    • TIME_TO_NON_BLANK_PAINT_MS
    • timezone_cat_0_2
  • experient 3:
    • daily_num_sessions_started
    • daily_num_sessions_started_max
    • FX_PAGE_LOAD_MS_2_PARENT
    • memory_mb
    • num_active_days
    • num_addons
    • num_bookmarks
    • profile_age
    • session_length
    • session_length_max
    • TIME_TO_DOM_COMPLETE_MS
    • TIME_TO_DOM_CONTENT_LOADED_END_MS
    • TIME_TO_DOM_INTERACTIVE_MS
    • TIME_TO_LOAD_EVENT_END_MS
    • TIME_TO_NON_BLANK_PAINT_MS
  • experient 4:
    • cpu_speed_mhz
    • daily_num_sessions_started
    • daily_num_sessions_started_max
    • default_search_engine_other_nonbundled
    • FX_PAGE_LOAD_MS_2_PARENT
    • memory_mb
    • num_active_days
    • num_addons
    • num_bookmarks
    • profile_age
    • session_length
    • session_length_max
    • startup_ms
    • startup_ms_max
    • TIME_TO_DOM_COMPLETE_MS
    • TIME_TO_DOM_CONTENT_LOADED_END_MS
    • TIME_TO_DOM_INTERACTIVE_MS
    • TIME_TO_LOAD_EVENT_END_MS
    • TIME_TO_NON_BLANK_PAINT_MS

Models

For the last step of our methodology, we use statistical matching methods to search for a subset of Beta clients that are representative of Release. To do that, a range of statistical matching models were reviewed, using the R library Matchit:

  • Coarsened Exact Matching (CEM). In CEM, instead of matching based on a composite propensity to be in either treatment or control, one simply coarsens each variable to a reasonable degree of clustering for each variable and then performs an exact match on these coarsened variables.
  • Nearest neighbor matching. This method selects the \(r\) (default = 1) best control matches for each individual in the treatment group, by using a distance measure specified by the distance option (default = logit). In short, at each matching step the method chooses the control unit that is not yet matched but is closest to the treated unit on the distance measure.
  • Nearest neighbor matching, with Mahalanobis distance measure
  • Subclassification matching. The goal of subclassification is to form subclasses, such that in each the distribution of covariates for the Beta and Release user groups are as similar as possible.

Before propensity scores are calculated, we define six covariate sets (experiments) to be utilized in the model selection. The first four experiments were obtained under the conditions described above. One of the remaining experiments was included according to statistical tests. That is, we compute the estimation of a normalized difference, a traditional statistical approach, which calculates the difference between the control and treatment group for every variable included in the selection model. In this case, absolute scores higher than 25% are considered suspect, and may indicate an imbalance for that specific variable. Variables that create imbalance should be included in the selection model. For the last experiment, we considered all the variables present in the data set (except for user engagement).

  • experient 5:
    • timezone_cat_10_12
    • timezone_cat_m10_m8
    • cpu_l2_cache_kb_cat_l512
    • num_pages_max
    • num_pages
    • profile_age
    • uri_count
    • timezone_cat_12_14
    • active_hours
    • uri_count_max
    • search_count
    • cpu_vendor_Intel
    • active_hours_max
    • daily_unique_domains
    • default_search_engine_other_bundled
    • daily_unique_domains_max
    • timezone_cat_m4_m2
    • cpu_vendor_AMD
    • default_search_engine_Bing
    • search_count_max
    • timezone_cat_m12_m10
    • timezone_cat_8_10
    • cpu_l2_cache_kb_cat_l1024
    • install_year
    • is_default_browser_True
    • cpu_speed_mhz
    • cpu_l2_cache_kb_cat_g1024
    • cpu_vendor_Other
    • default_search_engine_DuckDuckGo
    • cpu_l2_cache_kb_cat_l256
    • memory_mb
    • cpu_l2_cache_kb
    • startup_ms_max
    • default_search_engine_Yahoo
    • startup_ms
    • num_bookmarks
    • timezone_cat_m2_0
    • num_active_days
    • distro_id_norm_Yahoo
    • cpu_cores
    • default_search_engine_Google
    • daily_tabs_opened_max
    • daily_tabs_opened
    • timezone_cat_m8_m6
    • daily_max_tabs_max
    • distro_id_norm_other
    • daily_max_tabs
    • default_search_engine_other_nonbundled
    • distro_id_norm_acer
    • sync_configured_True
    • fxa_configured_True
    • daily_num_sessions_started
    • timezone_cat_6_8
    • timezone_cat_2_4
    • session_length_max
    • daily_num_sessions_started_max
    • TIME_TO_DOM_CONTENT_LOADED_END_MS
    • TIME_TO_NON_BLANK_PAINT_MS
    • locale_enUS
    • locale_enGB
    • distro_id_norm_Mozilla
    • FX_PAGE_LOAD_MS_2_PARENT
    • session_length
    • timezone_cat_4_6
    • timezone_cat_0_2
    • TIME_TO_DOM_INTERACTIVE_MS
    • TIME_TO_DOM_COMPLETE_MS
    • timezone_cat_m6_m4
    • TIME_TO_LOAD_EVENT_END_MS
    • country_US
    • timezone_offset
    • is_wow64_True
    • num_addons

Finally, a range of Beta overrepresentations were tested. In the following, 2x means there were twice as many Beta samples as Release.

1x Beta to Release (50% - 50%)

2x Beta to Release (70% - 30%)

4x Beta to Release (80% - 20%)

Results

The highest performant model was trained on the v67 data set and has the following properties:

  • Matching method: Nearest Neighbor
  • Beta oversampling: 1x Beta to Release
  • Covariates (Model Features): experiment 3

First Application - Training Set (V67)

In this application, we need to balance the two groups (Beta and Release) considering the other covariates (e.g., environment and performance metrics) and then look at the difference in user engagement metrics between the balanced Beta and Release for that version (N). The utility of this application is to inform us on how Beta is different concerning Release in user engagement, with all the other covariates being equal.

Holdout Covariates

  • active_hours
  • active_hours_max
  • uri_count
  • uri_count_max
  • search_count
  • search_count_max
  • num_pages
  • num_pages_max
  • daily_max_tabs
  • daily_max_tabs_max
  • daily_unique_domains
  • daily_unique_domains_max
  • daily_tabs_opened
  • daily_tabs_opened_max

Post-matching Beta-Release Difference

active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
beta (mean) 0.7977679 1.5379846 149.3878515 309.7738724 2.3311445 5.4187111 1.638074e+04 1.657372e+04 8.0837222 11.9707020 4.7234374 8.1824028 18.2089827 35.5746839
release (mean) 0.8498615 1.6306940 156.4986413 322.1612357 2.3730655 5.4347695 1.751605e+04 1.770435e+04 6.2733605 9.3960957 4.9844993 8.5736703 17.1239004 33.3607594
delta (mean) 0.0612966 0.0568527 0.0454368 0.0384508 0.0176654 0.0029548 6.481530e-02 6.386180e-02 0.2885793 0.2740081 0.0523747 0.0456359 0.0633665 0.0663631
beta (median) 0.5197569 1.0569444 86.1428571 175.0000000 0.8000000 2.0000000 3.846560e+03 3.998500e+03 3.8571429 6.0000000 3.3565341 5.0909091 8.3333333 16.0000000
release (median) 0.5796296 1.1597222 96.8000000 197.0000000 0.8750000 2.0000000 5.508600e+03 5.690000e+03 3.7500000 6.0000000 3.6000000 6.0000000 9.0000000 17.0000000
delta (median) 0.1032947 0.0886228 0.1100945 0.1116751 0.0857143 0.0000000 3.017174e-01 2.972759e-01 0.0285714 0.0000000 0.0676294 0.1515152 0.0740741 0.0588235
metric label active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
mean beta 0.8236611 1.577508 152.74550 311.0213 2.4506498 5.636171 17363.463 17558.93 9.603628 13.811495 5.060464 8.743610 20.491908 39.64786
mean beta - matched 0.7977679 1.537985 149.38785 309.7739 2.3311445 5.418711 16380.741 16573.72 8.083722 11.970702 4.723437 8.182403 18.208983 35.57468
mean release 0.8498615 1.630694 156.49864 322.1612 2.3730655 5.434769 17516.049 17704.35 6.273360 9.396096 4.984499 8.573670 17.123900 33.36076
median beta 0.5309524 1.063889 86.66667 172.0000 0.8333333 2.000000 4185.667 4340.00 4.250000 6.000000 3.562500 5.500000 9.000000 17.00000
median beta - matched 0.5197569 1.056944 86.14286 175.0000 0.8000000 2.000000 3846.560 3998.50 3.857143 6.000000 3.356534 5.090909 8.333333 16.00000
median release 0.5796296 1.159722 96.80000 197.0000 0.8750000 2.000000 5508.600 5690.00 3.750000 6.000000 3.600000 6.000000 9.000000 17.00000

Wilcoxon test

In short, we want to know if there is any significant difference between the average user engagement metrics in the Beta and Release groups, for the same version (v67). Here, we use the unpaired two-samples Wilcoxon test, which is a non-parametric alternative to the unpaired two-samples t-test used to compare two independent groups of samples. Our question is: Is there any significant difference between Beta (v67) and Release (v67) user engagement metrics?

If the resultatns p-values are less than the significance level \(alpha = 0.05\), we can conclude that Beta’s user engagement metrics, in average, are significantly different from Release users.

p_value diff
active_hours 0.0032274 TRUE
active_hours_max 0.0028748 TRUE
uri_count 0.0431074 TRUE
uri_count_max 0.0117049 TRUE
search_count 0.3103178 FALSE
search_count_max 0.2549964 FALSE
num_pages 0.0003315 TRUE
num_pages_max 0.0003186 TRUE
daily_max_tabs 0.2051756 FALSE
daily_max_tabs_max 0.1004788 FALSE
daily_unique_domains 0.0038884 TRUE
daily_unique_domains_max 0.0043621 TRUE
daily_tabs_opened 0.8696248 FALSE
daily_tabs_opened_max 0.4046741 FALSE

Visual inspection

For a graphical comparison, we plot the holdout covariate distributions for the following subsets:

  • Beta v67: pre-matching [blue]
  • Beta v67: matched and subsetted [pink]
  • Release v67 [purple]

NOTE: Guiding lines have been added for the following:

  • black solid: Release mean
  • black dashed: Release median
  • red dashed line: subsetted Beta mean.

Training Covariates

  • daily_num_sessions_started
  • daily_num_sessions_started_max
  • FX_PAGE_LOAD_MS_2_PARENT
  • memory_mb
  • num_active_days
  • num_addons
  • num_bookmarks
  • profile_age
  • session_length
  • session_length_max
  • TIME_TO_DOM_COMPLETE_MS
  • TIME_TO_DOM_CONTENT_LOADED_END_MS
  • TIME_TO_DOM_INTERACTIVE_MS
  • TIME_TO_LOAD_EVENT_END_MS
  • TIME_TO_NON_BLANK_PAINT_MS

Post-matching Beta-Release Difference

daily_num_sessions_started daily_num_sessions_started_max FX_PAGE_LOAD_MS_2_PARENT memory_mb num_active_days num_addons num_bookmarks profile_age session_length session_length_max TIME_TO_DOM_COMPLETE_MS TIME_TO_DOM_CONTENT_LOADED_END_MS TIME_TO_DOM_INTERACTIVE_MS TIME_TO_LOAD_EVENT_END_MS TIME_TO_NON_BLANK_PAINT_MS
beta (mean) 2.7321709 4.9926873 3247.1869234 9579.7998207 5.5523212 7.1386288 233.7638860 908.3743631 10.7072617 20.4610988 3851.5736062 2566.7619973 2102.7159586 3571.3891057 1657.5890160
release (mean) 2.8754903 5.2335855 3027.8108054 9447.2322438 5.5698425 5.6691919 160.4287272 896.5396045 9.3214724 18.2702112 3294.6494654 2296.3124719 1797.2162710 3019.4899187 1443.3046942
delta (mean) 0.0498417 0.0460293 0.0724537 0.0140324 0.0031457 0.2591969 0.4571199 0.0132005 0.1486664 0.1199158 0.1690390 0.1177756 0.1699849 0.1827789 0.1484678
beta (median) 1.8000000 3.0000000 2779.5305788 8058.0000000 6.0000000 6.0000000 24.9166667 721.0000000 6.4794677 12.2702780 2646.8827684 1717.7745048 1463.2633003 2452.7563807 1160.6414141
release (median) 2.0000000 4.0000000 2649.1832061 8071.0000000 6.0000000 5.0000000 26.0000000 704.0000000 6.4109720 11.7819440 2495.8947368 1628.9545455 1376.1397059 2298.4834123 1091.0504202
delta (median) 0.1000000 0.2500000 0.0492029 0.0016107 0.0000000 0.2000000 0.0416667 0.0241477 0.0106841 0.0414477 0.0604946 0.0545257 0.0633101 0.0671195 0.0637835
metric label daily_num_sessions_started daily_num_sessions_started_max FX_PAGE_LOAD_MS_2_PARENT memory_mb num_active_days num_addons num_bookmarks profile_age session_length session_length_max TIME_TO_DOM_COMPLETE_MS TIME_TO_DOM_CONTENT_LOADED_END_MS TIME_TO_DOM_INTERACTIVE_MS TIME_TO_LOAD_EVENT_END_MS TIME_TO_NON_BLANK_PAINT_MS
mean beta 2.368895 4.281399 3463.708 8965.156 5.346169 7.855376 242.48780 893.7534 12.296199 22.70657 4388.914 2737.638 2404.393 4126.912 1833.656
mean beta - matched 2.732171 4.992687 3247.187 9579.800 5.552321 7.138629 233.76389 908.3744 10.707262 20.46110 3851.574 2566.762 2102.716 3571.389 1657.589
mean release 2.875490 5.233586 3027.811 9447.232 5.569843 5.669192 160.42873 896.5396 9.321472 18.27021 3294.649 2296.312 1797.216 3019.490 1443.305
median beta 1.666667 3.000000 2952.174 8031.000 6.000000 7.000000 26.00000 711.0000 7.710555 14.80889 2918.205 1856.036 1618.024 2739.282 1251.720
median beta - matched 1.800000 3.000000 2779.531 8058.000 6.000000 6.000000 24.91667 721.0000 6.479468 12.27028 2646.883 1717.775 1463.263 2452.756 1160.641
median release 2.000000 4.000000 2649.183 8071.000 6.000000 5.000000 26.00000 704.0000 6.410972 11.78194 2495.895 1628.955 1376.140 2298.483 1091.050

Wilcoxon test

p_value diff
daily_num_sessions_started 0.0000026 TRUE
daily_num_sessions_started_max 0.0000010 TRUE
FX_PAGE_LOAD_MS_2_PARENT 0.0011319 TRUE
memory_mb 0.9551818 FALSE
num_active_days 0.8000686 FALSE
num_addons 0.0000000 TRUE
num_bookmarks 0.0075289 TRUE
profile_age 0.5966258 FALSE
session_length 0.0228950 TRUE
session_length_max 0.1650719 FALSE
TIME_TO_DOM_COMPLETE_MS 0.0004808 TRUE
TIME_TO_DOM_CONTENT_LOADED_END_MS 0.0354082 TRUE
TIME_TO_DOM_INTERACTIVE_MS 0.0002422 TRUE
TIME_TO_LOAD_EVENT_END_MS 0.0001210 TRUE
TIME_TO_NON_BLANK_PAINT_MS 0.0006373 TRUE

Visual inspection

NOTE: Guiding lines have been added for the following:

  • black solid: Release mean
  • black dashed: Release median
  • red dashed line: subsetted Beta mean

Discussion

Our main objective was to inform how users Beta are different concerning Release in terms of user engagement, with all the other covariates being equal. From these prior analysis, we can see that there are significant differences between both groups (Beta and Release) concerning some user engagement metrics, listed as follows.

  • num_pages
  • num_pages_max
  • active_hours
  • active_hours_max
  • uri_count
  • uri_count_max
  • daily_unique_domains
  • daily_unique_domains_max

In addition, by analyzing the distributions of the training covariates, we can see exactly which were the variables that presented the biggest discrepancies. That is, in which aspects the matching was not able to balance the datasets efficiently. The most different covariates are listed as follows.

  • num_addons
  • daily_num_sessions_started_max
  • daily_num_sessions_started

Second Application - Validation Set (V68)

In this application, we need to balance the Beta and Release data sets to resemble each other across the covariates we are concerned with, that is, the user engagement metrics. Balancing, in this case, yields a set of client_id for Beta that resembles Release. This gives us an idea of how these users do indeed change in time. If we see changes that are larger than anticipated, then we know that something significant is happening in user engagement that we can “forecast” in the subsequent Release.

The next step is to subset the validation v68 set by these matched Beta profiles. This reduces the Beta sample size used in the subsequent analysis:

  • v67 Beta subset: 21196 distinct profiles
  • v68 Beta subset: 14168 distinct profiles

Holdout Covariates

  • active_hours
  • active_hours_max
  • uri_count
  • uri_count_max
  • search_count
  • search_count_max
  • num_pages
  • num_pages_max
  • daily_max_tabs
  • daily_max_tabs_max
  • daily_unique_domains
  • daily_unique_domains_max
  • daily_tabs_opened
  • daily_tabs_opened_max

Training and Validation Difference:

Mean

active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
pre-matching 0.0606451 0.0998606 0.0728483 0.1234272 0.0480039 0.0946543 0.0829937 0.0840384 0.4287439 0.3486020 0.0082032 0.0291493 0.1734935 0.1202410
post-matching 0.0938118 0.0939657 0.0613485 0.0814801 0.0135225 0.0229419 0.0418302 0.0431463 0.3055825 0.2682372 0.0099754 0.0003550 0.0136131 0.0101047

Median

active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
pre-matching 0.1224242 0.1720430 0.1718107 0.2323232 0.25 0 0.3676804 0.3570644 0.1000000 0 0.0411921 0.1333333 0.0285714 0.1176471
post-matching 0.1331136 0.1338028 0.1317805 0.1308017 0.00 0 0.2223247 0.2144493 0.0322581 0 0.0142857 0.0769231 0.1000000 0.1000000
metric label active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
mean beta 0.7988445 1.471189 146.33392 287.3003 2.324319 5.100206 15614.038 15779.79 9.019717 12.828446 5.148112 8.581990 20.03166 37.29553
mean beta - matched 0.8681008 1.667582 162.35200 334.8136 2.538838 5.859896 20533.298 20742.87 8.533187 12.490330 5.402676 9.384929 18.95001 37.62945
mean release 0.8504182 1.634401 157.83170 327.7541 2.441522 5.633435 17027.187 17227.57 6.313040 9.512403 5.106225 8.839660 17.07011 33.29242
median beta 0.5027778 0.962500 80.50000 152.0000 0.750000 2.000000 3347.500 3513.00 4.125000 6.000000 3.500000 5.200000 8.50000 15.00000
median beta - matched 0.5935764 1.195833 99.84524 206.0000 1.000000 3.000000 6640.125 6861.00 4.000000 6.000000 3.833333 6.000000 9.00000 18.00000
median release 0.5729167 1.162500 97.20000 198.0000 1.000000 2.000000 5294.000 5464.00 3.750000 6.000000 3.650366 6.000000 8.75000 17.00000

Wilcoxon test

Now, we want to know if there is any significant difference between the average user engagement metrics in the Beta and Release groups, over several versions (v67 and v68). Once again, we use the Wilcoxon test with the following question: Is there any significant difference between Beta-matched (v68) and Release (v68) user engagement metrics?

p_value diff
active_hours 0.8058407 FALSE
active_hours_max 0.8665102 FALSE
uri_count 0.9320963 FALSE
uri_count_max 0.9368539 FALSE
search_count 0.7363237 FALSE
search_count_max 0.4779041 FALSE
num_pages 0.0040118 TRUE
num_pages_max 0.0040251 TRUE
daily_max_tabs 0.4330177 FALSE
daily_max_tabs_max 0.7022354 FALSE
daily_unique_domains 0.7043529 FALSE
daily_unique_domains_max 0.4844754 FALSE
daily_tabs_opened 0.6283246 FALSE
daily_tabs_opened_max 0.3686337 FALSE

Visual inspection

For a graphical comparison, we plot the covariate distributions for the following subsets:

  • Beta v68: pre-matching [blue]
  • Beta v68: matched and subsetted [pink]
  • Release v68 [purple]

NOTE: Guiding lines have been added for the following:

  • black solid: Release mean
  • black dashed: Release median
  • red dashed line: subsetted Beta mean

Training Covariates

  • daily_num_sessions_started
  • daily_num_sessions_started_max
  • FX_PAGE_LOAD_MS_2_PARENT
  • memory_mb
  • num_active_days
  • num_addons
  • num_bookmarks
  • profile_age
  • session_length
  • session_length_max
  • TIME_TO_DOM_COMPLETE_MS
  • TIME_TO_DOM_CONTENT_LOADED_END_MS
  • TIME_TO_DOM_INTERACTIVE_MS
  • TIME_TO_LOAD_EVENT_END_MS
  • TIME_TO_NON_BLANK_PAINT_MS

Training and Validation Difference:

Mean

daily_num_sessions_started daily_num_sessions_started_max FX_PAGE_LOAD_MS_2_PARENT memory_mb num_active_days num_addons num_bookmarks profile_age session_length session_length_max TIME_TO_DOM_COMPLETE_MS TIME_TO_DOM_CONTENT_LOADED_END_MS TIME_TO_DOM_INTERACTIVE_MS TIME_TO_LOAD_EVENT_END_MS TIME_TO_NON_BLANK_PAINT_MS
pre-matching 0.1568175 0.2044783 0.2423907 0.0858950 0.1394702 0.2138894 0.4308288 0.0116675 0.2776138 0.2040924 0.5155437 0.3545873 0.4943291 0.5457763 0.3981456
post-matching 0.1150352 0.1219282 0.0466569 0.0321503 0.0465495 0.1070154 0.3912255 0.0296786 0.2691643 0.2294827 0.1287980 0.1242510 0.1430486 0.1370603 0.1162488

Median

daily_num_sessions_started daily_num_sessions_started_max FX_PAGE_LOAD_MS_2_PARENT memory_mb num_active_days num_addons num_bookmarks profile_age session_length session_length_max TIME_TO_DOM_COMPLETE_MS TIME_TO_DOM_CONTENT_LOADED_END_MS TIME_TO_DOM_INTERACTIVE_MS TIME_TO_LOAD_EVENT_END_MS TIME_TO_NON_BLANK_PAINT_MS
pre-matching 0.1666667 0.25 0.2079933 0.0122646 0.1666667 0.2 0.1153846 0.0237389 0.0635072 0.0239411 0.2968198 0.2541060 0.2905375 0.3073948 0.2432585
post-matching 0.1538462 0.25 0.0399131 0.0000000 0.0000000 0.2 0.0294118 0.0515971 0.0757924 0.1015121 0.0666033 0.0665543 0.0719666 0.0721201 0.0627417
metric label daily_num_sessions_started daily_num_sessions_started_max FX_PAGE_LOAD_MS_2_PARENT memory_mb num_active_days num_addons num_bookmarks profile_age session_length session_length_max TIME_TO_DOM_COMPLETE_MS TIME_TO_DOM_CONTENT_LOADED_END_MS TIME_TO_DOM_INTERACTIVE_MS TIME_TO_LOAD_EVENT_END_MS TIME_TO_NON_BLANK_PAINT_MS
mean beta 2.398131 4.141417 3592.767 8795.994 4.912574 6.882667 225.4153 875.2575 12.336721 22.27297 4596.715 2896.937 2626.875 4375.866 2014.849
mean beta - matched 2.760920 5.067123 3007.539 9891.627 5.909091 6.471011 249.5533 1024.7847 11.449578 21.60048 3393.468 2399.501 1961.143 3160.932 1568.056
mean release 2.844142 5.205914 2891.817 9622.521 5.708778 5.669929 157.5418 885.5902 9.656064 18.49772 3033.047 2138.613 1757.896 2830.853 1441.087
median beta 1.666667 3.000000 3044.818 7973.000 5.000000 6.000000 23.0000 690.0000 7.154375 12.96139 3024.414 2001.342 1764.381 2856.994 1358.857
median beta - matched 1.833333 3.000000 2608.835 8073.000 6.000000 6.000000 33.0000 856.0000 7.165972 13.96167 2490.743 1713.148 1456.643 2329.033 1145.001
median release 2.000000 4.000000 2520.559 8072.000 6.000000 5.000000 26.0000 674.0000 6.727152 12.65833 2332.177 1595.832 1367.168 2185.257 1092.980

Wilcoxon test

Now, we want to know if there is any significant difference between the average user engagement metrics in the Beta and Release groups, over several versions (v67 and v68). Once again, we use the Wilcoxon test with the following question: Is there any significant difference between Beta-matched (v68) and Release (v68) user engagement metrics?

p_value diff
daily_num_sessions_started 0.3099819 FALSE
daily_num_sessions_started_max 0.3237056 FALSE
FX_PAGE_LOAD_MS_2_PARENT 0.1381689 FALSE
memory_mb 0.0632427 FALSE
num_active_days 0.0082311 TRUE
num_addons 0.0000000 TRUE
num_bookmarks 0.0000032 TRUE
profile_age 0.0000000 TRUE
session_length 0.2317669 FALSE
session_length_max 0.4197083 FALSE
TIME_TO_DOM_COMPLETE_MS 0.0173453 TRUE
TIME_TO_DOM_CONTENT_LOADED_END_MS 0.0090137 TRUE
TIME_TO_DOM_INTERACTIVE_MS 0.0976424 FALSE
TIME_TO_LOAD_EVENT_END_MS 0.0489124 TRUE
TIME_TO_NON_BLANK_PAINT_MS 0.2094423 FALSE

Visual inspection

For a graphical comparison, we plot the covariate distributions for the following subsets:

  • Beta v68: pre-matching [blue]
  • Beta v68: matched and subsetted [pink]
  • Release v68 [purple]

NOTE: Guiding lines have been added for the following:

  • black solid: Release mean
  • black dashed: Release median
  • red dashed line: subsetted Beta mean

Discussion

Our main objective was to determine if the user engagement metrics changed in the newest Beta version concerning the previous Release version. From these prior analysis, we can see that there are significant differences between both groups (Beta and Release) concerning only two user engagement metrics, listed as follows.

  • num_pages
  • num_pages_max

In addition, by analyzing the distributions of the training covariates, the most different covariates are listed as follows.

  • num_addons
  • profile_age
  • num_bookmarks
  • num_active_days

Overall Considerations

In this project, we employ statistical matching methods to find subsets of Beta users that can be used to inform how Release in the general user community will behave. Particularly, statistical matching is used to evaluate the effect of a treatment by comparing treated and untreated units in an observational study. The purpose of matching is, for each treated unit, to find one (or more) untreated unit(s) with similar observable characteristics against which the effect of the treatment can be assessed. By matching treated samples to similar untreated ones, matching enables a comparison of outcomes among treated and untreated samples to evaluate the effect of the treatment.

Here, we use statistical matching methods in a non-traditional way. In particular, our goal is to search for a subset of Beta representative clients of Release. In our application, we still have two cohorts, i.e., Beta and Release data sets for a given Firefox version (N). However, we do not have a single outcome, as we do not have a treatment. Rather, we are trying to equate (or balance) the Beta and Release data sets to resemble each other across the covariates we are concerned with. This is our outcome, so to speak.

In this current work, we focus on user engagement metrics as the chosen use-case. Also to validate the resultant matching model, we follow two different strategies. First, we balance the data sets on a set of training covariates (i.e., environment, machine configuration, performance and usage metrics), then look at the difference in the user engagement metrics between the balanced Beta and Release for a same Firefox version (v67). This gives us an idea of how clients with similar environments and performance resemble Release in terms of usage. In the second, we are more interested in balancing the data sets to resemble each other across the covariates we are concerned with, but now over several versions (v67 and v68). This gives us an idea of how these users do indeed change in time.

For both strategies, we apply the same experimental setup. The only difference is in the considered Firefox version we were comparing. In the first strategy, our main objective was to inform how users Beta are different concerning Release in terms of user engagement, with all the other training covariates being equal. Our findings show the matching worked well, in general. However, for a subset of covariates, the difference between channels actually increased (e.g., num_pages, num_pages_max, daily_unique_domains and daily_unique_domains_max), or are relatevly different than Release before and after matching, namely active_hours, active_hours_max, uri_count, and uri_count_max. Besides, we observe that these Beta users are very similar to Release ones regarding search count and daily tabs opened.

In the second strategy, our main objective was to determine if the user engagement metrics changed in the newest Beta version concerning the previous Release version. Overall, the matching yielded a subset that was similarly representative to v67 as to v68 for most of the covariates reviewed. However, for a subset of covariates, the difference between channels actually decreased (num_pages, num_pages_max, daily_unique_domains and daily_max_tabs).